Landmark tracking in liver US images using cascade convolutional neural networks with long short-term memory

Abstract Accurate tracking of anatomic landmarks is critical for motion management in liver radiation therapy. Ultrasound (US) is a safe, low-cost technology that is broadly available and offer real-time imaging capability. This study proposed a deep learning-based tracking method for the US image-guided radiation therapy. The proposed cascade deep learning model is composed of an attention network, a mask region-based convolutional neural network (mask R-CNN), and a long short-term memory (LSTM) network. The attention network learns a mapping from an US image to a suspected area of landmark motion in order to reduce the search region. The mask R-CNN then produces multiple region-of-interest proposals in the reduced region and identifies the proposed landmark via three network heads: bounding box regression, proposal classification, and landmark segmentation. The LSTM network models the temporal relationship among the successive image frames for bounding box regression and proposal classification. To consolidate the final proposal, a selection method is designed according to the similarities between sequential frames. The proposed method was tested on the liver US tracking datasets used in the medical image computing and computer assisted interventions 2015 challenges, where the landmarks were annotated by three experienced observers to obtain their mean positions. Five-fold cross validation on the 24 given US sequences with ground truths shows that the mean tracking error for all landmarks is 0.65 ± 0.56 mm, and the errors of all landmarks are within 2 mm. We further tested the proposed model on 69 landmarks from the testing dataset that have the similar image pattern with the training pattern, resulting in a mean tracking error of 0.94 ± 0.83 mm. The proposed deep-learning model was implemented on a graphics processing unit (GPU), tracking 47–81 frames s−1. Our experimental results have demonstrated the feasibility and accuracy of our proposed method in tracking liver anatomic landmarks using US images, providing a potential solution for real-time liver tracking for active motion management during radiation therapy.


Introduction
Accurate delivery of radiation dose to the intended treatment target is critical to the safety and efficacy of radiation therapy, especially in body sites where physiologic motion (respiration, sneezing etc) may cause significant short-term variability in anatomic position [1]. Any inaccuracy may lead to insufficient dose to the tumor target, geographic miss, or overdose of surrounding normal tissues. Many treatment protocols have attempted to reduce the impact of respiratory motion using breath-hold techniques. However, breath-hold substantially prolongs treatment time and may not be welltolerated by all patients [2]. Real-time motion tracking enables advanced motion management during treatment delivery to improve treatment safety and efficacy while benefitting more patients.
Two-dimensional x-ray imaging is commonly used to assist motion tracing, but it often requires implanting several fiducial markers into the moving organs to facilitate tracking [3]. Fiducial placement is invasive and carries with it a risk of side effects similar to those associated with other same-day operating room procedures. Recently, a few non-invasive tracking methods have been developed to track anatomic landmarks within moving organs with promising results [4,5]. Because it is non-invasive and low-cost while providing high soft tissue contrast in real time without additional radiation dose, ultrasound (US) imaging is an excellent candidate for real-time imaging for motion tracking during radiation therapy [6,7]. The automatic localization of landmarks potentially reduces the physician's cognition task and the manual error. However, US often suffers from low signal-to-noise ratio and imaging artifacts, making the motion tracking task on US images very challenging.
This landmark tracking problem is usually addressed by exploiting the relationship between the current image frame and the preceding frame in the US image sequence. Nouri and Rothberg trained a neural network by minimizing the Euclid distances between image patches containing the same landmark in an embedding subspace and then identified the target image patch with the shortest distance to the prior frame with a search window on the landmark [8]. Makhinya and Goksel extended the algorithm for superficial vein tracking using elliptical and template image sequences of the liver, followed by an optic-flow framework [9]. Hallack et al combined Log-Demons nonlinear registration to estimate motion with a moving-window tracking method to propagate motion around the region of interest (ROI) to subsequent frames [10]. Kondo proposed two extensions to the kernelized correlation filter using an adaptive window size selection and motion refinement with template matching [11]. Chen et al predicted the motion of anatomic targets in liver US sequences by a line regression-based ensemble of six machine learning models [12]. Ozkan et al proposed a supporter model to capture the coupling of motion between the image features and target so as to predict target position [5]. For 3D point-landmark tracking, Banerjee et al [13] proposed a 4D US tracking method based on global and local rigid registration schemes, while Royer et al [14] combined visual motion estimation with a mechanical model of the target. Williamson et al utilized a combination of template matching, dense optical flow and image intensity information for US target tracking in real-time [15]. However, the above methods often suffer from abrupt motions due to sneezing, coughing, etc. To handle the tracking drift issues caused by abrupt motions, Teo et al adopted a weighted optical flow algorithm to reduce the tracking drift of an uncontoured tumor [16], while O'shea et al [17] used the α−β filter/similarity threshold for image-guided radiation therapy. In addition, Harris et al [18] and Bell et al [19] conducted the study of in vivo liver tracking using 4D US. Van et al [20,21] developed deep learning-based methods to automatically detect and localize the B-lines in lung US images. Kulhare et al [22] adopted the convolutional neural networks to detect the abnormalities in lung US images.
In recent years, deep learning-based methods have become the benchmark in a wide range of image processing tasks, such as image segmentation [23,24], object detection [25], image classification [26] and image registration [27]. Gomariz et al proposed a fully convolutional Siamese network to learn the similarity between image patches related to the same landmark, where a temporal consistency model was built for regularization [28]. Huang et al used an attention-aware fully convolutional neural network to identify a suspect region and employed a convolutional long short-term memory (LSTM) to integrate temporal consistency in 3D US sequences [4]. Then, Huang et al used a machine-learning based approach to generate subject-specific motion pattern and updated the template image using the learned motion pattern to reduce the search region [29]. Liu et al proposed a one-shot deformable convolutional modal for enhancing the robustness to appearance variation in a meta-learning manner and combined this modal with a cascaded Siamese structure to enhance pixel-level tracking performance [30]. They also adopted an unsupervised training strategy to reduce the risk of overfitting on a limited sample of medical images. However, these methods are often lost in the similar image structures out of interest regions or missing the temporal features between frames. Dai et al [31] developed a Markov-like network, which is implemented via generative adversarial networks, to extract features from sequential US frames and thereby estimate a set of deformation vector fields (DVFs) through the registration of the tracked frame and the untracked frames. Finally, they determined the positions of the landmarks in the untracked frames by shifting landmarks in the tracked frame according to the estimated DVFs [32].
Mask region-based convolutional neural network (mask R-CNN) is a popular deep learning approach [25], which has achieved state-of-the-art performance in object detection [33,34]. This study aims to attempt the mask R-CNN framework on the task of landmark tracking in 3D US imaging of the liver, where both advantages in these above methods are considered. The reason why we chose the popular mask R-CNN for motion tracking is that there are multiple landmarks with different image structures in an image frame. Besides, R-CNN has a capability of handling abrupt motions due to its less dependency of contexts. We set out to address two limitations of mask R-CNN when applied to this task: (a) incorrect proposals due to the structural similarity in local US image structures, as the cyan ellipses in figure 1; (b) limited single-frame scope that fails to exploit the temporal relationship between successive frames in a US image sequence. To address these limitations, an attention network is designed to focus the ROI of landmark, while a LSTM network is integrated to recognize landmark motion continuity. In the present study, we evaluate the proposed method on liver anatomic landmarks. Our methods and dataset are presented in detail in section 2, experimental results in section 3, followed by discussion and conclusions in section 4.

Patient dataset
In this study, we used data provided for the medical image computing and computer assisted interventions 2015 Challenge on Liver Ultrasound Tracking (CLUST). The CLUST dataset is composed of 2D US image sequences acquired from 63 patients under a free-breathing protocol using 5 US scanners and 6 transducers. Based on the scanners used for acquisition, the US image sequences were divided into CIL, ETH, ICR, MED1 and MED2 groups, referring to the institutional image source [35]. The duration of the sequence's ranges from 4 s to 10 min. The temporal resolution ranges from 6 Hz to 30 Hz, and the spatial resolution ranges from 0.27 mm × 0.27 mm to 0.77 mm × 0.77 mm. Anatomic landmarks were annotated on 10%-13% of image frames per sequence, the number of landmarks range from one to five per sequence. These were manually annotated by three experienced observers; the resulting mean positions are used in this study. The 63 sequences were randomly divided into a training set of 24 sequences (40% of the dataset) and a test set of 39 sequences (60% of the dataset), yielding 53 landmarks for training and 85 landmarks for testing. For the test data, the annotation of landmarks in the first frame was provided to the trained network to track their positions in subsequent frames. Two example image frames are shown in figure 1. The regions indicated by cyan ellipse have a similar image pattern with the ground-truth landmark.

Attention mask R-CNN with LSTM
The proposed deep model for landmark tracking is composed of three networks: an attention network, a mask R-CNN network and a LSTM network. The attention network learns a mapping from an US image to the attention area, where the landmark motion occurs, to reduce the search image region. The mask R-CNN network produces multiple ROI proposals for the landmarks in this region and identifies the landmark via three network heads: a bounding box regression, a proposal classification, and a mask segmentation. The LSTM network utilizes the US image sequence to model the temporal relationship between successive frames to assist bounding box regression and proposal classification. We integrated the three modules into an end-to-end deep learning architecture, as shown in figure 2.

Attention network.
The attention network aims to identify an attention ROI where the target landmark appears in order to reduce the search region considered by the subsequent mask R-CNN. Note that our attention network, whose aim is to learn a region box, is different from the work of Huang that adopted a pyramid attention network to learn the attention features [36]. Since the landmark in the first frame is provided, we predefine a box that contains all its potential positions in the subsequent frames of the US sequence. A box of size 100 × 100 pixels centered on the landmark position in the first frame was selected as input for the attention network. To train this network, a box size of H × W pixels centered at the landmark of the target frame was designated as the ground truth. We extracted pairs of input and output boxes to learn the mapping that automatically reduces the search region. The number of pairs corresponds to all the landmarks provided in the training dataset. This box regression is made under the following objective function and the intersection over union: where U and V are the input image patch and the output ground-truth patch, respectively; N is the number of intersection pixels; t i and o i are the bounding box parameters of the target and the output, defined by: where x, y, w and h are the coordinates of the box center, width and height, and x, x a and x * are the predicted box, the anchor box of a sliding window [37] and the ground-truth box. Each anchor box, centered at the sliding window used in Mask R-CNN, is used as a reference to yield more region proposals. We adopted sliding-windows of different sizes to yield the proposal anchor boxes that have greater than 0.7 overlaps with the predefined box. After the attention network, the regions including irrelevant areas or ambiguous image areas are removed, as in figure 1, and greatly reducing the search cost within the target box.

Improved mask R-CNN (IMask R-CNN).
Mask R-CNN is adopted to detect the bounding box centered at the target landmark [25], which is achieved by three training objectives: a bounding box regression L box , a proposal classification L cls and a pixel classification L mask [33]: where ω = [ω 1 , ω 2 , ω 3 ] was set to [0.2, 0.2, 0.6] in our studies to emphasize the bounding box regression. While standard mask R-CNN utilizes the softmax function for multi-object classification, object classification is formulated as a binary classification task by separating boxes containing or not containing a landmark. In addition to being better suited to a binary classification task, the large margin function proposed by Elsayed et al [38] has shown better performance compared with the softmax function. Therefore, we employed the large-margin loss function: where l is the lth layer of the classification net, γ indicates the decision boundary, and a small perturbation ε = 1 × 10 −6 avoids the numerical instability. f o (x i ) and f t (x i ) are the scores classifying x i into the class o and its ground truth label t respectively. Equation (3) collects the margin losses at all layers for deep supervision [33].
Since mask images are composed of binary pixels, pixel classification is thus a binary problem. To improve pixel classification performance, we employed a large margin softmax loss initially proposed for binary pixel classification for needle localization in brachytherapy applications [39]: where y i is the ground truth label of x i , j ∈ {0, 1} indicates a non-landmark pixel or a landmark pixel respectively, W is the weights of the last fully connected layer, and φ (θ) is defined as [39]: where m is an integer that is closely related to the classification margin, and k ∈ [0, m − 1] is an integer. For bounding box regression, we used center localization with a 20 pixels × 20 pixels bounding box [33]. The objective function is: where N is the number of samples in a mini-batch and L (u) is the robust loss function [25] as:

LSTM network.
In an US image sequence, landmarks move continuously along successive frames, thus landmark position information obtained from one frame may improve localization in subsequent frames [6,40]. The power of LSTM networks for encoding context information and capturing temporal dependencies have been previously demonstrated [41].
The key components of the LSTM architecture providing this representational power are a memory cell that maintain its state over time and non-linear gating units which regulate information flow into and out of the cell. For each frame in an US sequence, each layer of the LSTM network computes h t at frame t as, where h t , c t , and x t are the hidden state, cell state and input at frame t respectively, h t−1 is the hidden state of the layer at frame t − 1 or the initial hidden state, and i t , f t , g t , o t are the input, forget, cell and output gates respectively. σ is the sigmoid function and ⊙ is the Hadamard product. The gates are used to transmit information from image frame to the next. In this study, LSTMs were adopted to capture temporal features for each proposal, similar to Huang et al [4]. During training, several proposals with the greatest overlap with the anchor box were considered to learn the gates in their corresponding cells. A similar LSTM network was used by Lei et al [42]. In the testing stage, proposals were fed into the corresponding LSTM to obtain the features, followed by fully connected networks for bounding box regression and object classification. Besides, we used LSTM for bounding regression and proposal classification but not for pixel classification to mitigate these bad effects from abrupt motion, such as cough and sneezing.

Similarity-based localization selection
The proposed approach (as shown in figure 2) often delivers multiple localizations with different scores for each landmark. However, the predicted localization with the highest score may correspond to confounding image structures that is similar to the target landmark. Since the movement of landmarks is smooth and continuous across successive frames, the landmark position at frame t can be inferred using the landmark position at frame t − 1, based on the similarity between sequential frames. To consider both the score from IMask R-CNN and this prior, we improved our scoring schema by accounting for the distance of the landmark's predicted location to its location in the previous frame: where x k is one localization with a score S k of all predictions at frame t, and x t−1 is the landmark at frame t − 1. γ is the trade-off parameter set to 0.5. Equation (6) is composed of both mask R-CNN scores and the distances to the landmark at frame t − 1. The combined score ranges from 0 to 1 and their contributions are controlled by γ, improving localization in our experiments.

Model training and evaluation
To train the proposed model, we set the learning rate to 1 × 10 −6 , five continuous proposals for LSTMs and terminated the training at 1000 epochs where the decrease of the training error between two epochs is less than 1 × 10 −3 . To evaluate our method on the training set of 24 US sequences, we used five-fold cross validation that divides the 24 sequences into five data subsets, i.e. four subsets with five sequences per subset and one subset of four sequences. In each fold, we trained our model using four subsets and then tested on the remaining subset. To evaluate the model on the test data of 39 sequences, we trained the model using all 24 training sequences before submitting the results to CLUST organizer.
Since the targets and the predictions are two landmark points, we as usual computed the tracking error by using the Euclidean distance between the prediction landmarks and the groundtruth landmarks as follows: where t i is the ground truth position and o i is the predicted position for a landmark in the ith frame. The individual errors were calculated on all frames containing ground truth landmarks. Then, all errors for a landmark were used to compute the average error and standard deviation for each landmark. Finally, the average tracking error, standard deviation and the 95th error percentile were calculated for each patient. The computation cost was assessed to show the real-time imaging capability in US-guided radiation therapy. Figure 3 shows tracking results on the US image sequences of five patients. Each was captured with a different scanner and contains one to five annotated landmarks. The tracking errors averaged over all frames are 0.79 ± 0.43 mm for patient CIL-02, 0.67 ± 0.58 mm for CIL-01, 0.71 ± 0.46 mm for ICR-01, 0.52 ± 0.31 mm for CIL-04, and 0.74 ± 1.14 mm for MED-02-3. As shown, the proposed method identified all landmarks on the five sequences regardless of the number of landmarks. A big mean error of 0.74 ± 1.14 mm is observed on the landmark within a shadow of patient MED-02-3, because the bounding box suffers from unsteadiness in the shadow. The tracking error is potentially caused by the inconsistency between landmark patterns and manual labels. Figure 4 shows the mean error and standard deviation of tracking errors obtained on all image frames for each of the 24 training sequences, using five-fold cross validation. As shown, the mean tracking errors of all sequences are within 1 mm. For the sequences of ETH-01-1, ETH-02-1, ETH-02-2 and MED-02-3, large standard deviations are observed due to shadows surrounding these landmarks in the US images. Because a large shadow causes a freedom for the bounding box localization.

Quantitative evaluation
The tracking errors on all landmarks is less than 2 mm, within acceptable clinical range [4,43]. The minimum tracking error of 0.37 ± 0.19 mm was observed for ETH-01-1, while the maximum of 0.88 ± 0.75 mm was observed for MED-04-1. Besides, the confidence intervals with the level of 0.99 are also calculated on each US sequence, plotted along with the mean tracing errors in figure 4. The results illustrate the proposed model delivers small swings with a high confidence level, showing its robustness in most cases. Table 1 summarizes the evaluation results summarized on all 24 US sequences in terms of the mean, standard deviation, and 95th error percentile. The best performance was achieved in ICR sequences relative to other sources. On this training dataset, composite error of 0.65 ± 0.56 mm is reported for all sequence sources across all landmarks.
To validate the effectiveness of the proposed net components, we removed the LSTM net from the proposed model, called WAN (with attention) for short, and removed the attention net from the proposed model, called WLSTM (with LSTM) for short, respectively. Table 2 summarizes the results on the training data with the proposed net components. From the results, the proposed model achieves better results in comparison with both WAN and WLSTM. Therefore, our method benefits from the attention mechanism and the LSTM net, and therefore yields effective tracking results.

Evaluation on test landmarks
We evaluated the proposed model on 69 test landmarks from the test dataset provided by this challenge organizer 6 . These test landmarks all have the similar image structure with the landmarks used for model training, where the image structure is shown in figure 3. Table 3 lists the evaluation results of these related methods. From the results, the proposed method achieves the better performance than other detection-based models, i.e. Nouri's model [8] and Gomariz's Model [28]. Figure 5 shows the error distribution of the 69 test landmarks. There are 47 landmarks whose tracking errors are within 1 mm, and 15 landmarks whose tracking errors are in the range of [1 mm, 2 mm). Figure 6 shows the results of landmark localizations on an image sequence from the test data. The five frames show that the proposed method achieved a good performance on the left landmark, while obtained a weak localization on the right    Bold represents better performance.   landmark. From observations, the left landmark has the same image structure with these landmarks used for model training, shown in figure 3. The right landmark has a greater shadow so that the localization is instable and inaccuracy.

Discussion and conclusion
Landmark tracking is one of several common non-invasive motion tracking strategies, which can be used for advanced motion management during radiation therapy. This study proposes a deep learning-based approach composed of an attention network, an IMask R-CNN, and an LSTM network to track landmarks on US image sequences. With the inclusion of an attention network, the search region can be greatly reduced.
In the focused region, our method yields ROI proposals which are then conditioned by box regression, pixel classification and object identification. To improve the performance of mask R-CNN for this binary classification, we implemented a large margin loss function with a deep supervision strategy for pixel classification and a large-margin softmax function for object identification. To exploit the temporal relationship between successive US frames, we also integrated an LSTM network into the mask R-CNN. Our experiments were conducted on the CLUST 2015 dataset, which was divided into a training set of 24 US image sequences and a test set of 39 sequences. Five-fold cross validation on the training dataset shows that our proposed method achieves an average error of 0.65 ± 0.56 mm with a 95% percentile tracking error of 1.57 mm. On the test set, evaluation results from the CLUST organizer shows that our method achieved 0.94 ± 0.83 mm on 69 test landmarks of the same with the training image structure. Besides, our proposed method could handle 47-81 frames s −1 , depending on the number of landmarks in the US image sequences. Comparing with other segmentation-based models, our method obtains the better performance on localization accuracy.
However, the proposed model has a limitation on tracking the landmarks with an unseen image structure, like other segmentation-based models [28]. Figure 7 shows different landmark structures in the test dataset, where structure (a) is same with these structures in the training dataset shown in figure 3. Image structures (b), (c) are out of the learned model. There are nine images and seven images on structure (b) and structure (c), where our model achieves 4.53 ± 2.16 mm and 8.81 ± 5.37 mm respectively. Thirteen 13 images of these 16 images were given a more than 3 mm localization errors.
On the other hand, the registration-based method has reached a tracking result of 0.69 mm on the CLUST test dataset [30]. While our model is less accurate than the current best method, our model has already acceptable performance on the localization accuracy and the tracking speed for a real-time clinical treatment system [4,44]. Compared with the convolutional LSTM model [4], our method has a filter component in the backbone of mask R-CNN so as to remove those localizations that lie in the similar image structures to target landmarks. Besides, our method is potential to be improved by training on more data and various landmark image structures. That is, the deep learning-based model is data hungry.
To have more evaluations, we conducted the extra experiments on the public dataset of cardiac acquisitions for multistructure ultrasound segmentation (CAMUS) that was used in our previous study [31]. The CAMUS dataset includes 2D US images from 450 patients and meanwhile contains expert annotations in the left atrium. With the same evaluation settings to our previous study, the proposed method achieves a comparable mean tracking error of 0.52 ± 1.33 mm while is more suitable for real-time landmark tracing than the previous generative adversarial network (GAN) model-based method.
In summary, this study makes an attempt on deep learningbased method for real-time landmark tracking, resulting in a high accuracy with a small standard deviation on the landmarks known by the model. The evaluation accuracy is less accurate than the best method in the field of US landmark tracking, but our method is based on the latest deep-learning techniques and leaves a lot of space to improve. Besides, the proposed method allows whole image input so that tracking becomes a simple mapping operation, while the performance is acceptable for real-time clinical applications.
In future works, a U-Net could be integrated in the output head for pixel classification to exploit coarse and fine image feature scales. In addition to exploiting 3D features, U-Net might enhance the stability of localization results across image frames. One might also incorporate a registration network to further improve performance [45]. Another consideration is integrating the model of B-model and the US priors [21] of physical property to deal with the spurious landmarks that are shown in figure 7. Overall, this study presents a potential method to address the problem of motion tracking for implementation in a real-time clinical system, suggesting advantages and disadvantages on the use of mask R-CNN.